Red Wine Data Exploration by Mario Hinojosa

Introduction:

The general approach to understanding the relationships in the dataset, particularly with respect to the quality of wine, is the following: 1. Look at the basic structure of the dataset in terms of number of variables, data types, N.A. values and so on. 2. Print basic statistics and histograms variables in the dataset to understand their distribution 3. Conduct bivariate analyses to understand the relationship between the different variables in the dataset. 4. Based on the preliminary findings of the bivariate analyses, the next step is to conduct multivariate analyses to further understand relationships in the dataset with the aim to uncover some insights related to the factors behind the quality of wine.

So, the first step is to load the data and check the structure of the dataset.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Right, there seems to be the need for some simple data manipulation (i.e. remove X and factorize quality)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

Univariate Plots Section

First step…let’s look at some basic statistics

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 53  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20   5:681  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42   6:638  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:199  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 18

Ok, so right away we notice that a few variable have some values that can be seen as outliers. For example, residual.sugar has a 3rd quartile value of 2.6 and the max value of this variable is 15.5; That is almost 6 times the distance.

We find a similar pattern with chlorides and total.sulfur.dioxide. We will be able to view this graphically with histograms, which is the next step.

There are way more wine datapoints with quality level of 5 and 6 than the rest of the ratings. Also, there are no wines below 3 and none above 6. Could this have an effect on the quality of the derived statistics or could affect the robustness of any model applied to this data?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

the variable sulphates seems to be right skewed and the median value is 0.62. Let’s transform it using a log10 scale.

## max value is 10.99347 s.d. away

Residual sugar also seems to have a right skewness due to some value being almost 11 standard deviations away from the mean. Again, let’s plot it using a log-transformation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH seems to be more normally distributed.

## max value is 13.98185 s.d. away

alcohol is also right skewed. It’s max value is almost 14 s.d. away. Let’s plot it using a log-transformation.

##  [1] 215  59  44  55  65  42  39  28  38  46  60  76  65  39  51  62  49
## [18]  33  33  57  45  38  41  41  88  30  27  20  18  17   3  19  21  13
## [35]   6   2   7   4   1   1   0   0   0   0   0   0   0   0   0   1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

hm… citric acid has an odd shape. it seems most of the observations have a value very close to zero, followed by 0.5. Also there seems to be an outlier observation.

Log-transforming it doesn’t seem to add much information either.

Univariate Analysis

What is the structure of your dataset?

  • The red wine dataset (rddf) consists of 12 variables of which 11 are numerical data and measure different chemical properties of the red wine. There is a total of 1599 observations.
  • quality is a factor variable which represents the quality of the wine graded on a scale between 0 (very bad) and 10 (very excellent). It seems that the majority of the observations received a grading between 5 and 6.
  • none of the wines were graded very bad or very good (below 3 or above 8).
  • several variables have outlier value in the upper range such as volatile.acidity, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphates

What is/are the main feature(s) of interest in your dataset?

After reading the dataset documentation, plotting some variables and from previous, albeit limited, personal knowledge about wine, I suspect that sulphates, volatile acidity and residual sugar are the main features that influence the quality of the wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Probably pH, citric acid and fixed acidity will also be relevant to study in more detail in the following sections.

Did you create any new variables from existing variables in the dataset?

At this point, given the basic exploration, I didn’t see the need to create any new variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • There seemed to be a number of variables that had outliers, thus skewing their distributions. Consequently, I log-transformed these right skewed variables to understand a bit better how the data is distributed.
  • Of these skewed variables, perhaps the one that caught my attention the most was citric.acid as it seems to have two bins with very high count (at 0.0 to 0.02 and 0.46 to 0.48), thus suggesting some sort of bimodality.
  • I removed one variable X because it was simply an index of the observations.
  • Also, I transformed the quality variable from numeric to a factor with the correct order. This was necessary as this is inherently a categorical and discrete variable.

Bivariate Plots Section

Let’s plot all variables against each other and print a numeric correlation matrix to see have a clearer picture of their relationships

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
##                         sulphates     alcohol
## fixed.acidity         0.183005664 -0.06166827
## volatile.acidity     -0.260986685 -0.20228803
## citric.acid           0.312770044  0.10990325
## residual.sugar        0.005527121  0.04207544
## chlorides             0.371260481 -0.22114054
## free.sulfur.dioxide   0.051657572 -0.06940835
## total.sulfur.dioxide  0.042946836 -0.20565394
## density               0.148506412 -0.49617977
## pH                   -0.196647602  0.20563251
## sulphates             1.000000000  0.09359475
## alcohol               0.093594750  1.00000000

Interesting…looking at the correlation matrix we see that density and sulphates seems to be more strongly correlated with other variables, particularly fixed acidity, citric acid and chlorides. Another interesting fact is that density is highly correlated with fixed acidity. Furthermore, the pairing plot shows that wines with higher quality tend to have a higher median alcohol and citric acid. In comparison, higher quality wines have lower median values for volatile acid and pH.

I want to see the aforementioned relationships with respect to quality in individual plots to have a better sense of their magnitude.

##   Group.1     x
## 1       3 0.845
## 2       4 0.670
## 3       5 0.580
## 4       6 0.490
## 5       7 0.370
## 6       8 0.370

The median value of volatile acidity drops to 0.37 in higher quality wines

##   Group.1     x
## 1       3 0.545
## 2       4 0.560
## 3       5 0.580
## 4       6 0.640
## 5       7 0.740
## 6       8 0.740

The median value of sulphates increases to 0.74 in higher quality wines

##   Group.1    x
## 1       3 3.39
## 2       4 3.37
## 3       5 3.30
## 4       6 3.32
## 5       7 3.28
## 6       8 3.23

In this case, the median value does decrease but less significantly than the previously examined variables

##   Group.1      x
## 1       3  9.925
## 2       4 10.000
## 3       5  9.700
## 4       6 10.500
## 5       7 11.500
## 6       8 12.150

Similar to sulphates, the median value of alcohol increases with the quality of wines. However, it must be noted that quality 5 has a number of observations with high alcohol content.

##   Group.1     x
## 1       3 0.035
## 2       4 0.090
## 3       5 0.230
## 4       6 0.260
## 5       7 0.400
## 6       8 0.420

The amount of citric acid and the quality of wine seem to have a noticeable positive relationship, going from 0.035 in grade-3 wine all the way up to 0.42.

Now I want to see in more detail the relationship between some features (i.e. not with respect to quality, the dependant variable) which I previously noticed had a strong correlation. More specifically, I want to explore in scatter plots the relationship of fixed acidity with respect to density and pH as well as citric acid with pH.

## [1] 0.6680473

## [1] -0.6829782

## [1] -0.5419041

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Some interesting relationships were discovered using bivariate analyses. First of all, I previously thought alcohol was not related to quality of wine. However, plotting quality vs alcohol gives a strong suggestion that the latter does affect the quality positively. That is, wines with higher quality tend to have a higher median alcohol. Having mentioned this, it is also worth noting that there were quite a number of observations that had high level of alcohol but not a high quality ranking. Furthermore, residual sugar, which I previously though was very relevant, doesn’t seem to have a strong relationship with quality of wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What I also noticed was that higher quality wines have lower median values for volatile acidity and pH

What was the strongest relationship you found?

Given that the dependent variable (quality of wine) is categorical, I couldn’t get a correlation number. However, the boxplots suggest that both, volatile acidity and pH have a strong negative relationship with quality. That is, the quality level increases as the level of pH and volatile acidity decreases. In comparison, alcohol and citric acid depict a strong positive relationship with quality of wine

Multivariate Plots Section

Let’s now plot the variables we just saw but in a multivariate way. First I want to examine density plots (colored by quality level) of variables examined in the previous boxplots. Then, I want to re-examine the relationship between fixed acidity, pH, sulphates and citric acid but this time adding quality level as color.

The density plots support the relationship between quality and the examined variables as shown in the boxplots from before. For example we see that alcohol distribution for high quality wines is shifted right. Also, at quality level 5 we notice a hump towards value 13% which also showed up in the boxplots.

The scatter plots colored by quality level don’t seem to reveal much info. However, the volatile acidity vs citric acid plot shows an interesting pattern: most of grade-7 wines have a volatile acidity below 0.4 and citric acid range of 0.25 - 0.75.

By now it seems clearer that citric acid is quite relevant to quality of wine. Now I’m going to plot it against a few varibles but this time I will facet them according to quality level to see if there is another interesting finding.

I don’t see anything unusual or additional information to what the other plots have shown.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Conducting some multivariate analyses of the main features identified seem to be pointing in the same direction of what was identified previously. That is, the distribution of good wines (7 and 8 rating) with respect to the selected features (see density plots) seems to stand out from lower grade wines. For instance, the distribution of citric acid for good quality wine is more left skewed than the lower grade wine distributions. A similar behaviour is seen with the sulphates density distribution when grouped by quality.

Were there any interesting or surprising interactions between features?

  • Citric acid vs. volatile acid plot shows a clearer pattern with respect to quality of wine. In other words, most of the observations for good quality wines (i.e. level 7) are located in a quadrant which has a volatile acidity below 0.4 and citric acid range of 0.25 - 0.75.
  • Looking at scatterplots faceted by quality shows once again that there are a lot of observations with a quality level of 5 and 6. However, when fitting a linear regression we see that the relationship is very similar in each facet.
  • Most of the relationship amongst features seem quite intuitive. For example, the different types of acidity are negatively related to pH, as by definition, pH measures the level of acidity.
  • The distribution of citric acid for quality of 5 and 6 seems to be trimodal at very similar levels (close to 0.0, 0.25 and 0.50 g / dm^3)

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

The objective of this exploratory analysis is to understand which features affect the quality of wine. In the course we learnt how to fit a linear regression on a numerical dependent variable. However, this time the dependent variable is categorical in nature, and unfortunately my knowledge of how to conduct this sort of model is extremely limited, hence I didn’t create a model.


Final Plots and Summary

Plot One

Description One

Citric acid median is considerably higher for good quality wines (7 and 8), going from 0.035 g/dm^3 for low quality wines all the way up to 0.42 g/dm^3 which equates to 12x the concentration.

Plot Two

Description Two

Alcohol median is notably higher for good quality wines, that is, grade 7 and 8. The latter having a median above 12%. However, there are some seemingly outlier wines that have high alcohol content but are graded as average (5 - 6). The most noticeable jump in median alcohol concentration is between level 6 and 7 which represents a 1 percentage point median increase.

Plot Three

Description Three

Volatile acidity (i.e. acetic acid) median is lower for good quality wines (grade 7 and 8), both having a median of 0.37 g/dm^3 (less than half compared to the lowest quality wines). However, there are some outlier wines that have high acetic acid content and yet are graded as average (5 - 6).


Reflection

The red wine dataset consists of 1599 observations of red variants of the Portuguese “Vinho Verde” wine.

The exploratory process began by understanding individual variables via summary statistics and histograms. This gave me a sense of the distribution of each individual variable but what really started to get me going and asking questions was after I plotted the correlation matrix as I was able to quickly see which variables seemed to have closer relationship with quality of wine such as alcohol, sulphates, volatility acid and specially citric acid. The following steps consisted of plotting in more detail the features I found interesting and trying to see if there was a pattern.

From the analyses conducted, and with my limited domain knowledge, I have noticed that three variables have a very close relationship with the quality of wines. Those variables are: citric acid, acetic acid and alcohol. These findings surface an interesting observation which is basically that not all acids are made equal, as some seem to be linked to higher quality wines (i.e. citric acid) and others seem to have a negative impact (e.g. acetic acid).

During my exploration I noticed that the number of observations varied widely from one quality level to another (5 and 6 had significantly more observations). I believe this could be a limitation on the data as ideally, one would be looking for enought data point at each quality level so as to make the statistical methods more robust. Moreover, I find quite strange that the dataset didn’t have any observations with quality levels below 3 and above 8. I have a feeling that this could also affect the robustness of statistical methods applied to this dataset.

My main struggle was not being able to fit a linear regression with quality as the dependent variable. This was due to the fact that at this point I don’t know how to apply a similar method when the dependent variable is categorial. Thinking how to expand my analysis, I believe that understanding how probit or logit regressions work would allow me to fit a model and obtain some numbers on the relationships I identified via plots.